Skip to content

fix(opencode): broaden transport errno coverage and stop retrying permanent DNS failures#1467

Merged
Astro-Han merged 3 commits into
devfrom
claude/i1105b-transport-errno
Jun 23, 2026
Merged

fix(opencode): broaden transport errno coverage and stop retrying permanent DNS failures#1467
Astro-Han merged 3 commits into
devfrom
claude/i1105b-transport-errno

Conversation

@Astro-Han

@Astro-Han Astro-Han commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Summary

Broadens the stream transport-failure classifier and makes its retryability per-errno. Before, classifyStreamFailure recognized only 4 errno codes (ECONNRESET/ECONNREFUSED/ETIMEDOUT/UND_ERR_SOCKET) and hardcoded every transport disconnect as retryable. Two gaps:

  • Common transient codes (DNS, undici timeouts, broken pipe) fell through to UnknownError.
  • A permanent failure like ENOTFOUND (an unresolved host — usually a misconfigured base URL) was retried up to the cap, stalling the turn instead of surfacing the cause.

Changes:

  • stream-failure-classifier.ts: broaden TRANSPORT_CODES (EAI_AGAIN, EPIPE, ECONNABORTED, EHOSTUNREACH, ENETUNREACH, ENOTFOUND, UND_ERR_CONNECT_TIMEOUT, UND_ERR_HEADERS_TIMEOUT, UND_ERR_BODY_TIMEOUT) + a tiny message fallback (socket hang up / premature close) for disconnects with no errno code. An error carrying an HTTP statusCode is short-circuited up front — an HTTP response was received, so it is an API error classified by status downstream (the APICallError branch in fromError, which runs after this) and must not be reclassified as transport by a transport-coded cause or a transport phrase in its message. "terminated" is intentionally not in the fallback (on its own it is undici's generic wrapper; only transport when its cause carries a transport code).
  • Per-errno retryability: TransportDisconnect.retryable is now a boolean. ENOTFOUND is non-retryable; EAI_AGAIN (transient DNS) and all others stay retryable.
  • message-v2.ts fromError: propagate transport.retryable to APIError.isRetryable instead of hardcoding true.
  • retry.ts classifyRetry: transport_disconnect now honors isRetryable (removed from RETRY_TRANSIENT_KINDS, given its own branch), so a non-retryable transport error takes the existing "not retryable → stop" path. retrySignal.retryable also feeds the recorded transport failure, so run-observability's retry_safety reports provider_terminal_failure for ENOTFOUND via the existing fix(opencode): stop recommending auto-retry for terminal provider failures #1118 branch.

Why

Split out of the error-classification work (#1105 / #1123) as its own slice because it changes retry strategy, not just classification. ENOTFOUND from a misconfigured base URL should fail fast with the real cause, not retry-storm into a stall; and the narrow errno set was dropping recoverable transient disconnects into opaque UnknownError.

Related Issue

Refs #1105 (classification spine), #1123 (classify-then-render umbrella).

Human Review Status

Pending

Review Focus

  • The statusCode short-circuit: it must not drop a real raw transport error (raw transport errors carry code/cause.code and no HTTP status; only HTTP responses have statusCode).
  • classifyRetry now honors isRetryable for transport_disconnect — confirm existing transport errnos (all retryable) still retry; only ENOTFOUND stops.
  • Dropping "terminated" from the message fallback (kept consistent with the existing over-match guard).

Risk Notes

Behavior change (intentional, tested): ENOTFOUND no longer auto-retries. classifyRetry's transient-kind contract changed for transport_disconnect (now honors isRetryable); one existing test that asserted "transport_disconnect retries even when isRetryable is false" was updated to the new per-errno contract. No data migration. No UI.

How To Verify

bun test src/session/stream-failure-classifier.test.ts src/session/retry.test.ts test/session/message-v2.test.ts → 111 pass, 0 fail
bun test test/session src/session test/provider src/provider → 1328 pass, 4 skip, 1 todo, 0 fail
tsgo --noEmit (packages/opencode) → clean
codex review (xhigh, 3 rounds) → final fresh-eye pass: no P0/P1/P2/P3

Pinned: ENOTFOUND → non-retryable transport (incl. nested cause); EAI_AGAIN / UND_ERR_HEADERS_TIMEOUT → retryable; bare "socket hang up" → retryable via message fallback; HTTP error with a transport phrase or transport-coded cause stays classified by status; classifyRetry honors isRetryable for transport_disconnect both ways.

Screenshots or Recordings

N/A — backend transport-classification change, no UI.

Checklist

  • Type label — this PR carries exactly one of bug, enhancement, task, documentation. Type labels are author-added; the labeler bot does NOT assign them. Add the label in the GitHub UI, then tick this.
  • Routing labels — this PR carries at least one of app, ui, platform, harness, ci. The labeler bot assigns these on PR open based on changed paths. Confirm the bot's choice (or override if wrong), then tick this.
  • Priority label — this PR carries exactly one of P0, P1, P2, P3. The priority-triage bot suggests one on PR open. Confirm or override, then tick this.
  • Human Review Status above is set to Pending, Approved by @<reviewer>, or Not required: <reason> (default is Pending; "not required" is restricted to bot-authored low-risk PRs).
  • I linked the related issue, or stated in Summary why there is no issue.

https://claude.ai/code/session_015bW9JQSkuB156gkNQdxCzi

Summary by CodeRabbit

  • Bug Fixes

    • Enhanced detection and classification of network transport disconnections with improved retry logic
    • Better identification of connection errors from system error codes and messages
    • Refined handling of connection failures to intelligently determine retry eligibility
  • Tests

    • Added comprehensive test coverage for transport error classification and retry behavior scenarios

…manent DNS failures

The stream failure classifier only recognized 4 errno codes
(ECONNRESET/ECONNREFUSED/ETIMEDOUT/UND_ERR_SOCKET) and hardcoded every
transport disconnect as retryable. Two gaps: common transient codes (DNS,
undici timeouts, broken pipe) fell through to UnknownError, and a permanent
failure like ENOTFOUND (an unresolved host — usually a misconfigured base URL)
was retried up to the cap, stalling the turn instead of surfacing the cause.

- stream-failure-classifier.ts: broaden TRANSPORT_CODES (EAI_AGAIN, EPIPE,
  ECONNABORTED, EHOSTUNREACH, ENETUNREACH, ENOTFOUND, UND_ERR_CONNECT_TIMEOUT,
  UND_ERR_HEADERS_TIMEOUT, UND_ERR_BODY_TIMEOUT) and add a tiny message
  fallback (socket hang up / premature close) for disconnects that arrive with
  no errno code. An error carrying an HTTP statusCode is short-circuited up
  front: an HTTP response was received, so it is an API error classified by
  status downstream and must not be reclassified as transport — neither by a
  transport-coded cause nor by a transport phrase in its message (fromError
  runs this classifier before the APICallError branch). "terminated" is
  intentionally NOT in the fallback — on its own it is undici's generic wrapper
  and is only transport when its cause carries a transport code.
- Make retryability per-errno: TransportDisconnect.retryable is now a boolean.
  ENOTFOUND is non-retryable (permanent DNS / misconfigured base URL); EAI_AGAIN
  (transient DNS) and all others stay retryable.
- message-v2.ts fromError: propagate transport.retryable to APIError.isRetryable
  instead of hardcoding true.
- retry.ts classifyRetry: transport_disconnect now honors isRetryable (removed
  from RETRY_TRANSIENT_KINDS, given its own branch) so a non-retryable transport
  error takes the existing "not retryable -> stop" path. retrySignal.retryable
  also feeds the recorded transport failure, so run-observability's retry_safety
  reports provider_terminal_failure for ENOTFOUND via the existing #1118 branch.

Tests: classifier (ENOTFOUND non-retryable incl. nested cause, EAI_AGAIN /
UND_ERR_HEADERS_TIMEOUT retryable, bare "socket hang up" message fallback, HTTP
error with a transport phrase or transport-coded cause stays unclassified);
classifyRetry (transport_disconnect honors isRetryable both ways); fromError
(ENOTFOUND -> non-retryable APIError, EAI_AGAIN -> retryable, APICallError with
"socket hang up" message + transport cause stays invalid_request).

Refs #1105 #1123

Claude-Session: https://claude.ai/code/session_015bW9JQSkuB156gkNQdxCzi
@Astro-Han Astro-Han added the bug Something isn't working label Jun 22, 2026
@coderabbitai

coderabbitai Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Warning

Review limit reached

@Astro-Han, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 31 minutes and 28 seconds. Learn how PR review limits work.

Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file).

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based credits.

🚦 How do rate limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan refill rate.

For paid Pro and Pro+ PR reviews, CodeRabbit uses rolling per-developer review limits. Reviews become available again as older review attempts age out of the rolling limit window.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: b21f0cbe-c607-407a-aa8a-2c3fc847c9a8

📥 Commits

Reviewing files that changed from the base of the PR and between fa32c5e and 772714a.

📒 Files selected for processing (1)
  • packages/opencode/src/session/message-v2.ts
📝 Walkthrough

Walkthrough

The PR refines transport-disconnect error classification by making TransportDisconnect.retryable a variable boolean (not always true), expanding the set of recognized transport error codes, and adding a classifyBareTransportMessage function for message-pattern-based detection. fromError is updated to use a new centralized transportDisconnectError helper, and classifyRetry gains a dedicated branch that gates retries on error.data.isRetryable for transport_disconnect errors instead of always retrying.

Changes

Transport-Disconnect Classification and Retry Refinement

Layer / File(s) Summary
TransportDisconnect type, code tables, and disconnect helper
packages/opencode/src/session/stream-failure-classifier.ts
TransportDisconnect.retryable changes from true literal to boolean. TRANSPORT_CODES is expanded with Undici/DNS codes. BARE_TRANSPORT_MESSAGES regex table is added. disconnect(code) helper sets retryable=false for ENOTFOUND and true for all other listed codes.
classifyStreamFailure and classifyBareTransportMessage
packages/opencode/src/session/stream-failure-classifier.ts, packages/opencode/src/session/stream-failure-classifier.test.ts
classifyStreamFailure short-circuits on numeric statusCode and uses disconnect() for per-code retryability on both top-level and cause-nested codes. New classifyBareTransportMessage matches trimmed error messages against BARE_TRANSPORT_MESSAGES patterns. Tests cover a full code/retryable matrix, edge cases, and bare-message anchoring.
fromError helper and transport-disconnect wiring
packages/opencode/src/session/message-v2.ts, packages/opencode/test/session/message-v2.test.ts
Adds transportDisconnectError(e, transport) helper that builds an APIError with isRetryable from transport.retryable. fromError delegates to this helper on classifyStreamFailure match and adds a bare-message fallback via classifyBareTransportMessage. Tests cover ENOTFOUND (non-retryable), EAI_AGAIN (retryable), bare "socket hang up", structured-body precedence, and HTTP-400 with UND_ERR_SOCKET cause.
classifyRetry transport_disconnect gating
packages/opencode/src/session/retry.ts, packages/opencode/src/session/retry.test.ts
transport_disconnect is removed from the blanket-transient retry set and given a dedicated branch that returns error.data.isRetryable. Tests verify isRetryable=true retries and isRetryable=false returns undefined.

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant fromError
  participant classifyStreamFailure
  participant classifyBareTransportMessage
  participant transportDisconnectError
  participant classifyRetry

  Caller->>fromError: fromError(e)
  fromError->>classifyStreamFailure: classifyStreamFailure(e)
  classifyStreamFailure-->>fromError: TransportDisconnect{code, retryable} | undefined

  alt transport code matched (e.g. EAI_AGAIN, ENOTFOUND)
    fromError->>transportDisconnectError: transportDisconnectError(e, transport)
    transportDisconnectError-->>fromError: APIError{isRetryable=transport.retryable}
  else falls through to bare-message check
    fromError->>classifyBareTransportMessage: classifyBareTransportMessage(e)
    classifyBareTransportMessage-->>fromError: TransportDisconnect | undefined
    alt bare message matched (e.g. "socket hang up")
      fromError->>transportDisconnectError: transportDisconnectError(e, bareTransport)
      transportDisconnectError-->>fromError: APIError{isRetryable=true}
    end
  end

  fromError-->>Caller: APIError
  Caller->>classifyRetry: classifyRetry(apiError)
  alt kind == transport_disconnect
    classifyRetry-->>Caller: apiError.data.isRetryable ? "unknown" : undefined
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Astro-Han/pawwork#1108: Introduces the ProviderFailureKind vocabulary and the initial transport_disconnect path in MessageV2.fromError that this PR refines with per-code retryability and bare-message matching.

Poem

🐇 Hoppity-hop through the socket maze,
Where ENOTFOUND gets no retry days,
EAI_AGAIN? We'll try once more!
Bare "socket hang up" — we've got a door.
Each errno code now knows its fate,
No blanket retries — we discriminate! 🌐

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: broadening transport error coverage and preventing retries of permanent DNS failures like ENOTFOUND.
Description check ✅ Passed The description is comprehensive and complete, covering all required sections including Summary, Why, Related Issue, Human Review Status, Review Focus, Risk Notes, How To Verify, and all checklist items are ticked.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch claude/i1105b-transport-errno

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@github-actions github-actions Bot added harness Model harness, prompts, tool descriptions, and session mechanics P2 Medium priority labels Jun 22, 2026

@github-actions github-actions Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested priority: P2 (includes non-doc, non-test paths outside the low-risk bucket).

P1/P0 are reserved for maintainer confirmation. Please relabel manually if this is a release blocker, security issue, data-loss risk, or updater/runtime failure.

Astro-Han added a commit that referenced this pull request Jun 23, 2026
…sking them (#1466)

Fixes the classification layer behind the "Connection lost" misreport for
provider billing/quota/auth failures (e.g. DeepSeek 402 "Insufficient Balance"),
which were classified as `unknown` with the nested provider reason dropped.

Change boundary (`packages/opencode/src/provider/error.ts` + tests):
- 402 -> `quota_exhausted`; strong/weak billing patterns (weak ones status-gated
  to 400/402/403 with no rate-limit signal; 429 and FreeUsageLimitError excluded).
- `message()` now parses the structured body before the reason-phrase
  early-return, so a custom SDK message ("API call failed") no longer suppresses
  the nested `{error:{message}}` reason.
- `parseStreamError` classifies auth/rate-limit/billing codes; a typed
  `{type:"error"}` envelope with an unhandled code upgrades to
  `APIError(kind="unknown")` so the frontend can read `code`/`responseBody`
  (a bare `{code}` body stays `UnknownError`).
- `looksTransientCode` is a deny-by-default exact allowlist
  (`resource_exhausted`, `unavailable`, `overloaded`, ...) instead of a substring
  scan, so terminal codes like `model_unavailable_for_account` are no longer
  retried in a loop.

Review follow-ups (GPT-Pro + CodeRabbit):
- P2 `message()` early-return fixed; 402 + non-reason-phrase regression test added.
- P2 billing over-match: guards pinned (rate-limit signal suppresses weak
  billing on a billing-shaped status; weak billing never overrides a 5xx).
- P3 transient substring -> exact allowlist; guards for `*_unavailable_for_*`
  (non-retryable) and gRPC `UNAVAILABLE` (retryable).
- P3 scope (split to a single narrow fix): kept as one coherent change — both
  entry points are the same defect and the typed-unknown fallback feeds the
  PR3 frontend decoder; the rationale is in the PR's Scope section.

Verification: provider+session suite 1337 pass / 0 fail; `tsgo --noEmit` clean;
`lint` clean; full CI green (43 checks).

Refs #1123, #1105. Part of the classify->passthrough->render series
(#1467 transport errno, #1468 run-incident terminal cause, #1469 frontend decode).

Residual: the per-kind error-card UX (PR4) is design-gated and deferred — see
docs/handoff/2026-06-23-error-render-pr4-handoff.md.
…d parsing

Review follow-up (P2 + P3):
- P2: classifyStreamFailure ran its `socket hang up` / `premature close`
  substring fallback before parseStreamError, so a structured stream error (e.g.
  invalid_prompt) carrying that phrase in its message or cause.body was
  mis-classified as a retryable transport disconnect — the user got a wrong retry
  instead of the real provider error. Split the bare-message fallback into
  classifyBareTransportMessage, anchor it to the whole trimmed message, and run
  it only AFTER parseStreamError fails. Errno-coded transport detection stays
  ahead of parsing (a definitive signal).
- P3: drop the single-element NON_RETRYABLE_TRANSPORT_CODES set for the simpler
  `retryable: code !== "ENOTFOUND"`.
- P3: make the classifier test a transport-code matrix (every errno asserts kind
  + retryability), plus anchored bare-message match / non-match cases.

Adds end-to-end fromError regressions: cause.body invalid_prompt + "socket hang
up" stays invalid_request (non-retryable); bare "socket hang up" stays a
retryable transport disconnect.

Verification: targeted suite 170 pass; session+provider 1351 pass / 0 fail;
tsgo --noEmit clean.

Claude-Session: https://claude.ai/code/session_012743rGkjEzqaUMKy2nvYMM

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@packages/opencode/src/session/message-v2.ts`:
- Around line 1469-1472: The `default:` switch case contains a variable
declaration for `bareTransport` without being wrapped in braces, which violates
the noSwitchDeclarations lint rule. Wrap the entire content of the default
clause (both the classifyBareTransportMessage call and the if statement checking
bareTransport) in curly braces to create a proper scope for the variable
declaration and resolve the lint error.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: a9d22f11-778a-4926-a64d-6058c4c79f0c

📥 Commits

Reviewing files that changed from the base of the PR and between f433411 and fa32c5e.

📒 Files selected for processing (6)
  • packages/opencode/src/session/message-v2.ts
  • packages/opencode/src/session/retry.test.ts
  • packages/opencode/src/session/retry.ts
  • packages/opencode/src/session/stream-failure-classifier.test.ts
  • packages/opencode/src/session/stream-failure-classifier.ts
  • packages/opencode/test/session/message-v2.test.ts

Comment thread packages/opencode/src/session/message-v2.ts
The default clause now declares `const bareTransport` directly at clause
scope (a lexical declaration introduced by the bare-transport fallback). Wrap
the clause in braces so the binding is scoped to the clause rather than the
whole switch — addresses CodeRabbit's noSwitchDeclarations nit. Pure scoping
change, no behavior difference.

Claude-Session: https://claude.ai/code/session_012743rGkjEzqaUMKy2nvYMM
@Astro-Han Astro-Han merged commit 430a213 into dev Jun 23, 2026
41 checks passed
@Astro-Han Astro-Han deleted the claude/i1105b-transport-errno branch June 23, 2026 09:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working harness Model harness, prompts, tool descriptions, and session mechanics P2 Medium priority

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant